MD5-related optimizations#6
Closed
Chainfire wants to merge 3 commits into
Closed
Conversation
Works just as well, prevents having to repeat them across files
MD5 hashes computed during rsync's block matching phase are independent and thus possible to process in parallel. This code processes 4 blocks in parallel if SSE2 is available, or 8 if AVX2 is available. An increase of performance (or decrease of CPU usage) of up to 6x has been measured. A prefetching algorithm is used to predict and load upcoming blocks, as this prevents the need for extensive modifications to other parts of the rsync sources to get this working.
Splits the input up into 8 independent streams (64-byte interleave), and produces a final checksum based on the end state of those 8 streams. If parallelization of MD5 hashing is available, the performance gain is 2x to 6x. xxHash is still preferred (and faster), but this provides a reasonably fast fallback for the case where xxHash libraries are not available at build time.
Member
|
Thanks! I've put the changes into a file named "md5p8.diff" in the rsync-patches repo for now. I incorporated some of the changes that put more info into lib/mdigest.h, and I tweaked a few things for style and to fix a compiler warning. Here's the resulting patch: |
Member
|
I'm going to leave it as a maintained patch for now and consider merging it later. |
Contributor
Author
|
I'll update this with the new build tests and applying to latest master |
Trogious
added a commit
to Trogious/rsync
that referenced
this pull request
May 14, 2026
rsync.exe -av <local> user@host:/dst/ now transfers files over SSH with byte-exact verification. Idempotent re-push transfers 0 bytes. Four fixes to clear the runtime path after build came up: * win32/win_select.c: select() shim. winsock's select() only handles SOCKETs; rsync's io.c calls select() on the pipe fds from piped_child. Classify each fd via GetFileType+GetNamedPipeInfo, defer sockets to real winsock select, poll pipes via PeekNamedPipe. 10 ms cadence. ~170 LOC. * win_spawn.c: bump CreatePipe buffer hint to 1 MB so the file-list phase doesn't deadlock on a full 4 KB anonymous pipe. * util1.c::change_dir: treat 'X:\…', 'X:/…', and '\…' as absolute on Windows. Normalize curr_dir to forward slashes after getcwd so path joins don't mix separators. * syscall.c::do_open_nofollow: force O_BINARY (MSVC defaults to text mode); skip the lstat→open→fstat dev/ino symlink-race check on Windows because MSVC's stat/fstat don't return stable values for those fields. Pull and local-copy still hit the RtlCloneUserProcess fork hang — tracked as task RsyncProject#6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tridge
added a commit
to tridge/rsync
that referenced
this pull request
May 20, 2026
…ncProject#6 (chmod mode arg) These two small syscall.c fixes were made at the start of the round-4 work but got dropped on the floor when I split the commit -- only the docs (RsyncProject#1) and the bigger RsyncProject#2/RsyncProject#4/RsyncProject#5 deferred-immutable-dir series ended up landed. The tree was left dirty. RsyncProject#3: do_rename (the non-_at variant) was missing the hardlink-aware restore I added to do_rename_at last round. Same shape -- when renameat replaces a destination inode that had st_nlink > 1, the remaining hardlinks survive carrying the cleared flags. Restore via new_fd before close (the fd still refers to the surviving inode). RsyncProject#6: do_chmod and do_chmod_at force_change recovery were calling make_mutable_fd(fd, mode, ...) where mode was the caller-supplied chmod-target mode -- some callers (notably xattrs.c's set_xattr recovery path) pass perm bits only, no S_IFREG / S_IFDIR, so on Linux rsync_fchflags rejects the call as neither regular file nor directory and recovery silently fails. Use st.st_mode from the freshly-fstatted target instead, which always has the right S_IFx bits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a 3-parter. The first commit moves the OpenSSL related defines from from checksum.c to mdigest.h, because I found myself copy/pasting them more than once otherwise.
The second commit enables parallel computation of MD5 hashes in the block matching get_checksum2() phase. As each blocks' hash is independent, we can process up to 4 blocks simultaneously with SSE2, and 8 blocks with AVX2, leading to a real-world 2x to 6x performance gain / CPU usage reduction (even over OpenSSL-optimized MD5).
However, to make this happen without significant changes to the rest of rsync's codebase, a block prefetcher had to be created. (This whole commit requires --enable-simd as my previous contributions). Full compatibility is maintained with non-SIMD counterparts.
The same mechanism could be used for multithreading checksums as well, but that is beyond the scope of this patch.
Both optimized parallel processing is available (--enable-simd) as reference C. MD5P8 is slightly slower on <10kB files due to the additional overhead, but similar to MD5 on larger files without SIMD, and much faster on larger files with SIMD.
Note that get_checksum2() keeps using normal MD5 even if whole-file checksum is MD5P8, because that is parallelized with SIMD anyway if available, and using MD5P8 would just add overhead and quite probably be slower.
Further note that the CSUM_MD5 and CSUM_MD5P8 defines now appear in both checksum.c and simd-checksum-x86_64.cpp, they need to be kept in sync, perhaps moved to a header?
Motivation: though xxhash is now available for rsync, it is not included into the code itself but an external dependency, and by my last evaluation, many distros do not yet come with xxhash included, and thus the distro-included rsync package will be built without xxhash support. That being said, I can imagine that this PR may not be merged due to it not being part of the direction rsync is moving in, I myself need to be using it due to having an uncommon build target, and the code might as well be available to everyone.
The parallel computation of MD5 hashes in get_checksum2() will benefit connections to both recent builds without xxhash as well as older builds of rsync if the block-matching phase applies. If it doesn't lead to a reduction in transfer time due to connection or disk speed limitations, then it will at least massively reduce CPU usage on the supporting client.
The use-case for MD5P8 is more limited, as its usefulness requires both ends to be running an supporting rsync build, but one end not supporting xxhash. If both ends do support xxhash, that should always be the preferred checksum (while MD5P8 can reach gigabytes per second, xxhash is still twice as fast). I only created it as it was a small effort now that parallel MD5 computation was available anyway, and it doesn't have any external dependencies.
I've done some benchmarks for transferring 1GB files between a fast and a slow CPU on 1GbE LAN, compared to normal MD5 usage (all tests already including my previous block size patches and get_checksum1() optimizations):
get_checksum2() MD5 parallelization with MD5 whole-file checksum, both files existing on both ends:
get_checksum2() MD5 parallelization and MD5P8 whole-file checksum, both files existing on both ends:
xxhash for both get_checksum2() and whole-file checksum, both files existing on both ends:
MD5P8, new file:
xxhash, new file:
MD5P8, local checksum:
xxhash, local checksum:
Obviously these are highly specific to my setup and YMMV. However, my daily syncing of TB's of data is now twice as fast, with average CPU usage down to less than a quarter. xxhash doesn't run ahead much in this case because CPU power while checksumming is no longer the bottleneck after these patches. With even faster network and disks (10GbE + NVMe) xxhash might be twice as fast.